Putting Visual Analytics into Practical Use
The purpose of this take-home exercise, to reveal the economic of the city of Engagement, Ohio USA by using appropriate static and interactive statistical graphics methods.
For the analysis, we will explore both the Participants attributes and their financial background. We will first extract required dataset from the Participants.csv and subsequently perform a join to merge the two csv files.
We will explore the following analysis:
We need to install any new packages we plan to load in an R markdown document. The following libraries will be incorporated to construct the required interactive charts:
To install and load the new packages every time we load the R project, we will execute the following code before commencing on the analysis and constructing of graphs.
packages = c('tidyverse', 'ggrepel', 'ggiraph', 'patchwork', 'plotly', 'hrbrthemes',
'heatmaply', 'viridis','ggridges','binr', "trelliscopejs","zoo","ggthemes")
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The dataset provided was retrieved from Vast Challenge 2022 and it
consist of multiple csv files. For this exercise, we will only focus on
two csv files mainly: Participants.csv and
FinancialJournal.csv. To import the csv files, we will use
the read_csv command instead of read.csv
command to safeguard the integrity of the data.
financeJ <- readRDS(file = "data/financeJ.rds")
participantsD <- readRDS(file = "data/participantsD.rds")
head (financeJ)
# A tibble: 6 x 4
participantId timestamp category amount
<dbl> <dttm> <chr> <dbl>
1 0 2022-03-01 00:00:00 Wage 2473.
2 0 2022-03-01 00:00:00 Shelter -555.
3 0 2022-03-01 00:00:00 Education -38.0
4 1 2022-03-01 00:00:00 Wage 2047.
5 1 2022-03-01 00:00:00 Shelter -555.
6 1 2022-03-01 00:00:00 Education -38.0
head (participantsD)
# A tibble: 6 x 7
participantId householdSize haveKids age educationLevel
<dbl> <dbl> <lgl> <dbl> <chr>
1 0 3 TRUE 36 HighSchoolOrCollege
2 1 3 TRUE 25 HighSchoolOrCollege
3 2 3 TRUE 35 HighSchoolOrCollege
4 3 3 TRUE 21 HighSchoolOrCollege
5 4 3 TRUE 43 Bachelors
6 5 3 TRUE 32 HighSchoolOrCollege
# ... with 2 more variables: interestGroup <chr>, joviality <dbl>
To bin the participants age group evenly, the binr
library is used. I set the target bins at 7 to generate 7 different
grouping for the age group and subsequently create a vector to input a
new age group column inside the data table.
cuts <- bins(participantsD$age,target.bins = 7,minpts = 120)
cuts$breaks <- bins.getvals(cuts)
cuts$binct
[18, 23] [24, 29] [30, 35] [36, 42] [43, 48] [49, 54] [55, 60]
136 137 154 153 136 156 139
age_group1 <- 18:23 #create a vector of age
age_group2 <- 24:29
age_group3 <- 30:35
age_group4 <- 36:42
age_group5 <- 43:48
age_group6 <- 49:54
age_group7 <- 55:60
participants_final <- participantsD %>%
mutate (agegroup= case_when(
age %in% age_group1 ~ "18-23",
age %in% age_group2 ~ "24-29",
age %in% age_group3 ~ "30-35",
age %in% age_group4 ~ "36-42",
age %in% age_group5 ~ "43-48",
age %in% age_group6 ~ "49-54",
age %in% age_group7 ~ "55-60")) %>%
select (participantId,agegroup)
To have the agegroup in the finance data table, I use the
merge function and set the participantId as the key for the
merging. I then rename the column and create a year month timestamp
using the zoo library.
Expenses is calculated using the different category (Education, Shelter, Food and Recreation). We will first pivot the table to insert them as column before mutating the expenses column.
data_plot <- final %>%
select(c("Timestamp","Participant_ID", "category","amount")) %>% #choose the columns to subset
group_by(Timestamp,category,Participant_ID)%>%
summarise(amount = round(sum(amount),2)) %>%
pivot_wider(names_from = "category",values_from = "amount")
data_plot[is.na(data_plot)] = 0
data_plot [3:8] <- abs(data_plot[3:8])
data_plot <- data_plot %>%
mutate(Expenses = (Education + Food + Recreation + Shelter))
We will examine the participants Wage and Expenses individually using
the interactive line and dot plot against all the different months.
Trelliscope is used due to the large amount of participant
to examine. From the chart, we are able to identify at least 10%
(131) of the participants did not include their expenses after
the first input in Mar 2022. The highest expenses for most participants
occured in Mar 2022.
trelis_plot <- data_plot %>%
select(c("Timestamp","Participant_ID","Expenses", "Wage"))
ggplot(trelis_plot, aes(y = Expenses, x = (Timestamp))) +
geom_point() + geom_line() +
labs (y = "Total Expenses \n of the Month", x = 'Month', title = "Monthly Expenditure of Participant") +
scale_x_date(date_breaks = "months",
date_labels = "%b") +
facet_trelliscope(~ Participant_ID, nrow = 2, ncol = 2, width = 800, as_plotly = TRUE, path = "trellis/")
We will then examine the density ridge plot for Wage and Expenses by the different months to identify a trend. From the plot, we observe that the bottom quantile (0-5%) of Wage did not have any significant different throughout the year of recording. But the top 95% had a slight difference in the month of July and Oct. This might be due to the summer break in the United States that result in more youngster working in that period, increasing the Wage gained throughout the year. For the expenses, there is no significant observation throughout the year. We will zoom in later using the heatmap to identify the different Age Group expenditure.
trelis_plot$Timestamp <- as.Date(trelis_plot$Timestamp,format="%Y/%m/%d")
trelis_plot$timestamp <- strftime(trelis_plot$Timestamp, format="%b-%Y")
trelis_plot %>%
arrange(desc(lubridate::my(timestamp))) %>%
ggplot(aes(x = Expenses, y = fct_inorder(timestamp), fill = stat(quantile))) +
stat_density_ridges(scale = 2, quantile_lines = TRUE,
calc_ecdf = TRUE,
geom = "density_ridges_gradient",
quantiles = c(0.05, 0.5, 0.95)) +
scale_fill_manual(name = "Prob.", values = c("#E2FFF2", "white", "white", "#B0E0E6"),
labels = c("(0, 5%]", "(5%, 50%]", "(50%, 95%]","(95%, 1]")) +
coord_cartesian(clip = "off") + # to avoid clipping of the very top of the top ridgeline
theme_ridges()
trelis_plot %>%
arrange(desc(lubridate::my(timestamp))) %>%
ggplot(aes(x = Wage, y = fct_inorder(timestamp), fill = stat(quantile))) +
stat_density_ridges(scale = 2, quantile_lines = TRUE,
calc_ecdf = TRUE,
geom = "density_ridges_gradient",
quantiles = c(0.05, 0.5, 0.95)) +
scale_fill_manual(name = "Prob.", values = c("#E2FFF2", "white", "white", "#B0E0E6"),
labels = c("(0, 5%]", "(5%, 50%]", "(50%, 95%]","(95%, 1]")) +
coord_cartesian(clip = "off") + # to avoid clipping of the very top of the top ridgeline
theme_ridges()
Based on the histo-dot plot, we can see that at least 5% of participants do not have at least 20% of savings at the end of the year based on the guideline provided. More than 10% of the participants spend more than 50% of their income on Needs (Shelter + Education + Food) which is more than the 50% guideline provided. This could signify a high standard of living in that area which require the population to spend more than 50% of their income on expenses that are essentials.
histo_plot <- final %>%
select(c("Participant_ID","category","amount","agegroup")) %>%
group_by(Participant_ID,category,agegroup) %>%
summarise (amount = round(sum(amount),2)) %>%
pivot_wider(names_from = "category",values_from = "amount")
histo_plot[is.na(histo_plot)] = 0
histo_plot [3:8] <- abs(histo_plot[3:8])
histo_plot <- histo_plot %>%
mutate(Expenses = Education + Food + Recreation + Shelter) %>%
mutate(Income = RentAdjustment + Wage) %>%
mutate (Savings = Income - Expenses) %>%
mutate (PctSavings = round((Savings/Income) *100,0)) %>%
mutate(PctNeedsExpenses = round(((Education+Food+Shelter)/Income) *100,0))
p <- ggplot(data=histo_plot,
aes(x = PctSavings)) +
geom_vline(aes(xintercept = 20.5), color="red", linetype = "dashed", alpha = 0.3) +
annotate ("text", x=9, y = 0.5, label = "Participants with less than \n20% of Savings", size = 2.5, angle = 0,) +
geom_dotplot_interactive(
aes(tooltip = agegroup,
data_id = agegroup),
stackgroups = TRUE,
binwidth = 1,
method = "histodot") +
scale_y_continuous(NULL, breaks = NULL) +
scale_x_continuous(labels = c("0","20","40","60", "80", "100"),breaks = c(0,20,40,60,80,100)) +
theme_few() + labs (title = "Participants Savings in Percentage", x = "Savings (%)")
p2 <- ggplot(data=histo_plot,
aes(x = PctNeedsExpenses)) +
geom_vline(aes(xintercept = 49.5), color="blue", linetype = "dashed", alpha = 0.3) +
annotate ("text", x=70, y = 0.5, label = "Participants that spend more than \n50% of income on Needs", size = 2.5, angle = 0,) +
geom_dotplot_interactive(
aes(tooltip = agegroup,
data_id = agegroup),
stackgroups = TRUE,
binwidth = 1,
method = "histodot") +
scale_y_continuous(NULL, breaks = NULL) +
scale_x_continuous(labels = c("0","20","40","60", "80", "100"),breaks = c(0,20,40,60,80,100), limits = c(0,100)) +
theme_few() + labs (title = "Participants Expenses (Needs) in Percentage", x = "Expenses (%)")
girafe(
code = print (p/p2),
width_svg = 8,
height_svg = 8)
heat_plot <- final %>%
select(c("timestamp","agegroup", "category","amount")) %>% #choose the columns to subset
group_by(timestamp,category,agegroup)%>%
summarise(amount = round(sum(amount),2)) %>%
pivot_wider(names_from = "category",values_from = "amount")
heat_plot[is.na(heat_plot)] = 0
heat_plot [3:8] <- abs(heat_plot[3:8])
heat_plot <- heat_plot %>%
mutate(Expenses = (Education + Food + Recreation + Shelter))
heat_plot$Weekdays <- weekdays(heat_plot$timestamp)
heat_plot$Month <- format(heat_plot$timestamp, "%b")
heat_plot_final <- heat_plot %>%
select(c("Month", "Weekdays", "agegroup", "Recreation", "Food")) %>%
group_by(Month,Weekdays,agegroup)%>%
summarise(Recreation = round(mean(Recreation),2), Food = round(mean(Food),2) )
Through the heatmap of recreation expenses, we see that there is an increased expenditure of Recreation Activities in the Month of Feburary, October, Novemember and December. But for Food Expenditure, Tuesday Saturday and Sunday have more expenditure throughout the year than the different months.
p <- ggplot(heat_plot_final, aes(x=Weekdays,y=Month)) +
geom_tile(aes (fill = Recreation)) +
scale_fill_gradient(low="lightblue1",high="darkblue",trans="log10") +
labs(title = "Average Recreation Expenditure of Participants by Month and Day",
x = "Weekdays",
y = "Month",
fill = "Recreation") +
theme_bw() +
theme(text = element_text(family = 'Fira Sans'))
fig <- ggplotly(p)
fig
x <- ggplot(heat_plot_final, aes(x=Weekdays,y=Month)) +
geom_tile(aes (fill = Food)) +
scale_fill_gradient(low="lightblue1",high="darkblue",trans="log10") +
labs(title = "Average Food Expenditure of Participants by Month and Day",
x = "Weekdays",
y = "Month",
fill = "Food") +
theme_bw() +
theme(text = element_text(family = 'Fira Sans'))
fig <- ggplotly(x)
fig